A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information

نویسندگان

George E. Tsekouras

Damianos Gavalas

Stefanos Filios

Antonios D. Niros

George Bafaloukas

چکیده

We present a novel focused crawling method for extracting and processing cultural data from the web in a fully automated fashion. After downloading the pages, we extract from each document a number of words for each thematic cultural area. We then create multidimensional document vectors comprising the most frequent word occurrences. The dissimilarity between these vectors is measured by the Hamming distance. In the last stage, we employ cluster analysis to partition the document vectors into a number of clusters. Finally, our approach is illustrated via a proof-of-concept application which scrutinizes hundreds of web pages spanning different cultural thematic areas.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Hybrid Focused Crawling Algorithm to Build Domain-Specific Collections

The Web, containing a large amount of useful information and resources, is expanding rapidly. Collecting domain-specific documents/information from the Web is one of the most important methods to build digital libraries for the scientific community. Focused Crawlers can selectively retrieve Web documents relevant to a specific domain to build collections for domain-specific search engines or di...

متن کامل

An Effective fuzzy Clustering Algorithm for Web Document Classification: a Case Study in Cultural Content Mining

This article presents a novel crawling and clustering method for extracting and processing cultural data from the web in a fully automated fashion. Our architecture relies upon a ‘focused’ web crawler to download web documents relevant to culture. The term ‘focused crawler’ refers to web crawlers that search and process only those web pages that are relevant to a particular topic. After downloa...

متن کامل

Evaluation of a Graph-based Topical Crawler

Topical (or, focused) crawlers have become important tools in dealing with the massiveness and dynamic nature of the World Wide Web. Guided by a data mining component that monitors and analyzes the boundary of the set of crawled pages, a focused crawler selectively seeks out pages on a pre-defined topic. Recent research indicates that both the textual content of web pages and the structural inf...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Improving the performance of focused web crawlers

This work addresses issues related to the design and implementation of focused crawlers. Several variants of state-of-the-art crawlers relying on web page content and link information for estimating the relevance of web pages to a given topic are proposed. Particular emphasis is given to crawlers capable of learning not only the content of relevant pages (as classic crawlers do) but also paths ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2008

A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information

نویسندگان

چکیده

منابع مشابه

A Novel Hybrid Focused Crawling Algorithm to Build Domain-Specific Collections

An Effective fuzzy Clustering Algorithm for Web Document Classification: a Case Study in Cultural Content Mining

Evaluation of a Graph-based Topical Crawler

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Improving the performance of focused web crawlers

عنوان ژورنال:

اشتراک گذاری